[Frontend] Review fixes: floor/mod guard, decompose, graph-copy, vcix guards + SDPA fused-matmul/exp by YWHyuk · Pull Request #264 · PSAL-POSTECH/PyTorchSim

YWHyuk · 2026-06-18T05:25:21Z

Fixes from the max-effort code review of #259, each verified against the C++ reference and reachability. Squashed/folded with the former #265 (SDPA vcix fix).

Correctness

get_dma_info: raise on unlinearized floor/mod (store-side ModularIndexing, reduction-axis floor/mod) instead of silently mis-striding ([A]).
decompose_transfer: pick the non-unit dim per collapse group (was a trailing unit dim); >4D peel raises when the vlane axis is peeled into the outer loop ([B][C]).
graph_copy: per-dim max extent for the consumer iteration shape (was rank-only, an order-dependent silent miss of an incompatible-radix conflict) ([I]).
lower_to_vcix _lower_matmul: lower a fused matmul whose operand is produced in-place by affine.vector_store (SDPA scores·V), not just DMAed -- was left un-lowered -> wrong attention output. Mirrors the C++ isAInitialized/isBInitialized.
lower_to_vcix _make_sf_vc_v_iv: fix the ExtractStridedSliceOp arg order/types so the n>1 transcendental chunk (SDPA softmax exp) stops crashing.

Hardening / parity / docs

lower_to_vcix: F16/BF16 sew=0 + rank!=1 guard (C++ parity, fp16 transcendentals stay on -convert-math-to-llvm) ([E]); matmul M/N/K-multiple-of-SS and A/B subtileK guards as loud failures ([F]).
graph_copy default-on docstrings corrected ([H]).

Refuted (no change): dma_fine_grained bias tag map (Python writer/wait mutually consistent), build_tog loop-stride dedup (no duplicate keys emitted).

Validated end-to-end (Spike+TOGSim allclose): elementwise, gemm, bmm, conv2d, group_conv, pool, cat, floor/mod suite (incl. pixel_shuffle), reduce, softmax, layernorm, batchnorm, gqa, SDPA (was crash/wrong) -- all pass. SDPA bisected: C++ vcix passes, Python vcix did not; localized to _lower_matmul.

…ases Three fixes from the max-effort review of this branch: - get_dma_info: after retiring the floor/mod recompile branches, a residual floor/mod (store-side ModularIndexing, reduction-axis floor/mod, incompatible radix) that axis-split/graph-copy did not linearize was silently bucketed by its base symbol in the dram_stride loop, emitting a wrong DRAM descriptor. Raise NotImplementedError instead of mis-striding silently. No test triggers it (0 floor/mod reach get_dma_info in the suite) -- it is a safety net. - decompose_transfer collapse fast path: keep=[g[-1]] picked the last dim of each reassociation group, which is a unit dim when trailing unit dims attach after the non-unit one (e.g. [..,4,1,1]); strides/subtile were read from the wrong axis. Pick the non-unit dim in each group. - decompose_transfer >4D peel: new_vlane fell back to 0 whenever the vlane split axis was not among the inner 4 dims, conflating peeled-into-the-outer-loop (genuinely unrepresentable -> raise) with a unit lane axis (default 0 is fine). Validated: elementwise, gemm, conv2d, cat, floor/mod suite (incl. pixel_shuffle >4D peel), softmax, layernorm, batchnorm -- all pass, no spurious raise. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

More fixes from the max-effort review, after verifying each against the C++ reference and reachability: - graph_copy _relayout_args: ranges picked the consumer iteration shape by rank alone (max key=len), so for two equal-rank operands with different per-dim extents the broadcast-from operand's smaller shape could win and the real incompatible-radix conflict on the broadcast-to dim was missed (order-dependent: a commutative reorder flipped correct relayout into a silent miss). Use per-dim max extent over the max-rank operands. - lower_to_vcix _sew/_legalize_vector_type: mirror the C++ legalizeVectorType -- F16/BF16 return sew 0 (transcendentals stay unlowered for -convert-math-to-llvm, as in the validated path) instead of being lowered to VCIX, and add the missing rank != 1 guard. - lower_to_vcix matmul: port the C++ guards as loud failures -- M/N/K must be a multiple of the systolic size when > SS (else the N//SS / K//SS loops drop the tail tile), and A vs B must agree on the K subtile (last-writer-wins would pick one silently). Latent today (heuristic/autotune only emit SS-multiple tiles). - Doc-only: graph-copy is default-on (TORCHSIM_GRAPH_COPY=0 to disable); fixed the two stale 'no-op unless set' comments. Validated: elementwise, gemm, bmm, conv2d, group_conv, cat, floor/mod suite, softmax, layernorm -- all pass. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…fix exp chunk Two fixes to the C++->Python vcix port (lower_to_vcix.py) that SDPA exercises but the gemm/bmm/conv tests do not: - _lower_matmul bailed with 'if ATag is None or BTag is None: return False', gating on an MVIN dma_start tag for both operands. In SDPA's fused scores.V matmul, operand B is the softmax output produced in place by affine.vector_store, not DMAed, so BTag stayed None and the matmul was left un-lowered -> wrong attention output. Mirror the C++ MatmulOpLowering: an operand is initialized by either a dma_start OR a preceding affine.vector_store into its root memref; bail only when an operand is truly uninitialized. BTag/BAsync stay None/0 and are only read under 'if BAsync:', so the B dma_wait is correctly skipped (as in C++). - _make_sf_vc_v_iv n>1 transcendental chunking called vector.ExtractStridedSliceOp(offsets, sizes, strides, vec) -- wrong arg order, missing the result type and vector operand, raising TypeError under these MLIR bindings. Pass (result=legal_ty, vector=vec, offsets, sizes, strides). Only reached by large transcendentals (n>1), e.g. SDPA softmax exp, so CI's small-tile (n==1) tests never hit it. Validated end-to-end (Spike+TOGSim allclose): SDPA 56 cases pass (was crash/wrong); matmul/bmm/conv2d regress clean. Bisected: C++ vcix passes SDPA, Python vcix did not; exp chunking and fine-grained ruled out separately. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

YWHyuk and others added 3 commits June 18, 2026 14:25

YWHyuk mentioned this pull request Jun 18, 2026

[Frontend] vcix: lower fused matmul whose operand is vector_store'd, fix exp chunking #265

Closed

YWHyuk changed the title ~~[Frontend] Fix floor/mod guard + decompose unit-dim/vlane-axis edge cases~~ [Frontend] Review fixes: floor/mod guard, decompose, graph-copy, vcix guards + SDPA fused-matmul/exp Jun 18, 2026

YWHyuk merged commit 417e4f2 into feature/tog-python-binding Jun 18, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Frontend] Review fixes: floor/mod guard, decompose, graph-copy, vcix guards + SDPA fused-matmul/exp#264

[Frontend] Review fixes: floor/mod guard, decompose, graph-copy, vcix guards + SDPA fused-matmul/exp#264
YWHyuk merged 3 commits into
feature/tog-python-bindingfrom
fix/review-floormod-decompose

YWHyuk commented Jun 18, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

YWHyuk commented Jun 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

YWHyuk commented Jun 18, 2026 •

edited

Loading